DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters
نویسندگان
چکیده
When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework—DC-Prophet—based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F3-score.
منابع مشابه
Communication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology
By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...
متن کاملTransfer Learning-Based Co-Run Scheduling for Heterogeneous Datacenters
Today’s data centers are designed with multi-core CPUs where multiple virtual machines (VMs) can be colocated into one physical machine or distribute multiple computing tasks onto one physical machine. The result is co-tenancy, resource sharing and competition. Modeling and predicting such co-run interference becomes crucial for job scheduling and Quality of Service assurance. Co-locating inter...
متن کاملVM Consolidation by using Selection and Placement of VMs in Cloud Datacenters
The Cloud Computing model leverages virtualization of computing resources allowing customers to provision resources on-demand on a pay-as-you-go basis. During recent years, the power consumption of datacenters in cloud environment attracted researchers. Optimization of energy consumption can be performed by different methods including virtual machine (VM) consolidation. This technique can reduc...
متن کاملCommunication-efficient Outlier Detection for Scale-out Systems
Modern scale-out services are built on top of large datacenters composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale outages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, har...
متن کاملHigh-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
Google’s Ads Data Infrastructure systems run the multibillion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Mos...
متن کامل